Draft version March 17, 2010 

Preprint typeset using I^'T^]X style emulatcapj v. 11/10/09 



AUTOMATIC UNSUPERVISED CLASSIFICATION OF ALL SDSS/DR7 GALAXY SPECTRA 
J. Sanchez Almeida^'^, J. A. L. Aguerri^'^, C. Munoz-Tunon^-^, and A. de Vicente^-^ 

Draft version March 17, 2010 

ABSTRACT 

Using the k-means cluster analysis algorithm, we carry out an unsupervised classification of all 
galaxy spectra in the seventh and final Sloan Digital Sky Survey data release (SDSS/DR7). Except 
for the shift to restframe wavelengths, and the normalization to the (?-band flux, no manipulation is 
applied to the original spectra. The algorithm guarantees that galaxies with similar spectra belong 
to the same class. We find that 99% of the galaxies can be assigned to only 17 major classes, with 
11 additional minor classes including the remaining 1%. The classification is not unique since many 
galaxies appear in between classes, however, our rendering of the algorithm overcomes this weakness 
with a tool to identify borderline galaxies. Each class is characterized by a template spectrum, which 
is the average of all the spectra of the galaxies in the class. These low noise template spectra vary 
smoothly and continuously along a sequence labeled from to 27, from the reddest class to the 
bluest class. Our Automatic Spectroscopic K-means-based (ASK) classification separates galaxies in 
colors, with classes characteristic of the red sequence, the blue cloud, as well as the green valley. 
When red sequence galaxies and green valley galaxies present emission lines, they are characteristic 
of AGN activity. Blue galaxy classes have emission lines corresponding to star formation regions. 
We find the expected correlation between spectroscopic class and Hubble type, but this relationship 
exhibits a high intrinsic scatter. Several potential uses of the ASK classification are identified and 
sketched, including fast determination of physical properties by interpolation, classes as templates in 
redshift determinations, and target selection in follow-up works (we find classes of Seyfert galaxies, 
green valley galaxies, as well as a significant number of outliers). The ASK classification is publicly 
accessible through various websites. 

Subject headings: catalogs - methods: statistical - galaxies: evolution - galaxies: fundamental pa- 
rameters - galaxies: statistics 



L INTRODUCTION 

The nebula^ are so numerous that they cannot be 
studied individually. Therefore, it is necessary to know 
whether a fair sample can be assembled from the most 
conspic uous objects a nd, if so, the size of the sample re- 
quired (jHubbld ll936L Chapter II). Even though these 
arguments are from the outset of extragalactic astron- 
omy, and they refer to the morphological classification 
of galaxies, the reasons put forward by Hubble remain 
valid today. The need to sort out and simplify justify all 
recent efforts to classify the spectra of galaxies (§ II. 1|) . 
including the present work. Such attempts are now more 
significant than ever since we have never had the large 
catalogs of galaxy spectra available today. 

The seventh and final Sloan Digital Sky Survey data 
release ( SDSS/DR7) provides spectra of some 930000 
galaxies (jStoughton et al.l 120021: lAbazajian et all I2009L 
and also the SDSS Web sit^l). This uniform data set 
offers a unique opportunity to comprehensively classify 
the different spectra existing among nearby galaxies. Our 
paper presents the results of an unsupervised spectral 
classification of all the catalog. Unsupervised implies 
that the algorithm does not have to be trained. It is 

|josl8ia c.es, jalfonso@iac.es, cmt@iac.es, angelv@iac.es] 

^ Institute de Astrofisica de Oanarias, E-3S20S La Laguna, 
Tenerife, Spain 

^ Departamento de Astrofisica, Universidad de La Laguna, E- 
38071 La Laguna, Tenerife, Spain 
^ The galaxies. 
* |http : //www . sdss . org/dr7| 



autonomous and self-contained, with minimal subjective 
influence. Thus, we deliberately avoid the use of physi- 
cal constraints, or other a priori knowledge. We classify 
all galaxies simultaneously, requesting that galaxies with 
similar rest-frame spectra belong to the same class. This 
approach is in the vein of the rule s for a good classifica- 
tion discussed by ISandagd (|2005l ) , where he points out 
that physics must not drive a classification. Otherwise 
the arguments become circular when the classification 
is used to drive physics. The k-means algorithm that 
we implement is commonly employed in data mining, 
machine learning, and artificial intelligence (e.g.. lEverittl 
119951 : lBishopl[2006( l , but it has been seldom applied in as - 
tronomy (see, however. iSanchez Almeida fc Litesll2000( ). 
From the point of view of the algorithm, the galaxy spec- 
tra are vectors in a high-dimensional space, where they 
are distributed among a number of cluster centers. Each 
vector is assigned to the cluster whose center is near- 
est, and the center is the average of all the points in the 
cluster. It works iteratively. Starting from guess cluster 
centers, the spectra are assigned to their nearest centers, 
and then the centers are re-computcd until convergence 
is reached. (Further details are given in §[21) We choose 
it because of its extreme computational simplicity, as re- 
quired to deal with large data sets (§ [2|), and because 
it turned out to work very well in the first case we at- 
tempted. To our surprise, the algorithm managed to 
separate spectra of galaxies in the green valley within a 
collection of dwarf galaxies encompassing the fu ll range 
of spectral types (jSanchez Almeida et al.l [20091 § 3.1). 



2 



Sanchez Almeida et al. 



Therefore, we found it natural to test the ability of k- 
means to distinguish among all kinds of galaxy spectra, 
and the success of this follow-up exercise is precisely the 
work reported here. In addition to the above virtues, the 
k-means method provides a prototypical high signal-to- 
noise spectrum for each class of galaxy, being the spec- 
tra of the galaxies in a class similar to the associated 
prototypical spectrum. These few representative spec- 
tra can be studied and characterized in detail as if they 
were individual galaxies, and then their properties can 
be attributed to all class members (ij [T0|). Other popular 
classification methods lack this powerful and convenient 
feature (see II. ip . 

The acronym ASK stands for Automatic Spectroscopic 
K-means-based, and it is used throughout the text to 
denote our classification. The paper is structured as fol- 
lows. § 11.11 provides an overview of the main spectral 
classification methods employed so far. It also summa- 
rizes systematic trends resulting from the application of 
those methods. Our k-means classification algorithm is 
examined in § [2l where we test the class recovery upon 
known classes (§ 12. ip . we analyze the repeatability of 
the classification (§ 12. 2p . and we assign probabilities to 
class membership (§ [231) • The SDSS/DR7 dataset is 
briefiy introduced in § |3l The actual classification of 
SDSS/DR7 is described in § gl The ASK classification 
is compared with Principal Component Analysis (PGA) 
classification in § [5] (see also § II. ip . The self-consistency 
of the ASK classification is discussed in various sections 
dealing with specific results; relationship between ASK 
class and Hubble type (§ [B]), ASK class and color se- 
quence (§ [T]), ASK class and AGN activity (§ H]), and 
ASK class and redshift (§ [9]). Further applications of 
the classification procedure are sketched in § [TOl The 
ASK classification is publicly available as we explain in 
§ [TTJ This section also outlines ongoing works based on 
ASK. 

1.1. Spectral classification of galaxies 

The first spectral classifications of galaxies are al- 
most co eval with the discovery of the Hubble sequence. 
iHubbld (jT936) discusses how the spectral types and 
colors systematically vary within the morphological se- 
quence, being ell ipticals the redd est and open spirals the 
bluest (see also iHumason '1931*) . One of the early at- 
tempts to set up an spectroscopic classification of galax- 
ies is that bv lMorgan fc Mavalll (|1957[ ). They assign the 
blue part of the visible spectrum (3850 A - 4100 A) to 
stellar classes from A to K. They find a clear relation- 
ship between spectral class and shape, with the most 
concentrated galaxies (E, SO) belonging to class K, and 
the most diffuse galaxies (Sc, Irr) included in class A. 
The relationship applies to some 80% of the galaxies, a 
percentag e probably large r for the targets of highest lu- 
minosity. lAaronsonl (|1978[ ) shows how the visible and IR 
colors of galaxies along the Hubble sequence can be un- 
derstood as a one parameter family, in terms of the super- 
positi on of spect ra of AOV dwarf stars and MOIII gigant 
stars. iBershadyl (jT995) points out that a simple model 
consisting of two stellar spectral types can reproduce the 
observed broad band colors, but only if the spectral types 
are allowed to vary. Five primary spectral types result 
from this modeling. Similar conclusions are also reached 



bv lZaritskv et al.l ()1995D using stellar spectrum fitting. 

Principal Component Analysis (PCA) is probably the 
most popular classification method employed so far. 
Each spectrum is decomposed as a linear superposition 
of a small number of eigenspectra, so that a few coeffi- 
cients in this expansion (eigenvalues) fully describe the 
spectrum. It is fairly fast an d robust, and a solid math- 
ematical theory supports it ()Everittill995D . To the best 
of our knowledge, the first applications of PC A in this 
field have to do with stellar classification (e.g.. iDeemingl 
1964; Whitne y 198 ^, then moved to qu asar spectra (e.g., 
iMittaz et al.l ITgMlRancis et al.lll992D . and finally ar- 
rived to the spectral classification of regular g alaxies 
fe.g.. lSodre fc CuevaslUggl IConnollv et al.lll995n . PCA 
is the method of reference, and we compare it in § [5] 
with our k-means. Two general results are common to 
all PCA analyses. Spectrumwise, galaxies can be charac- 
terized and distinguished by means of a single parameter 
that links the coefficients of the two or three first eigen- 
spectra. Then different classes are obtained by splitting 
(somewhat artificially) this 1-dimensional family into 
pieces. The approach holds for 2dF galaxies (Folkes et alj 
,1999: Madawick 2003{) for ga laxies in Kcimicutt (199^ 
(|Connollv et al.l IT99a ISodre fc CuevasI 1199711. for Las 
Campanas Redshift Survey galaxies (iBromlev et al.l 
119981) . for D EEP2 galaxies ( Madgwick e t al.ll2003D .^^ 
lUE galaxies (iFqrmiggini fc Broschi2004i) , and for SDSS 
(lYip et al.ll2004D . The second common result is the 
correspondence between spectral sequence and Hubble 
type. Even though ellipticals tend to be red and spi- 
rals tend to be blue, such relationship has a large intrin- 
sic scatter (IConnollv et al.l[l995l: ISodre fc CuevasI 119971: 
iFerreras et aL 20061) . which augments towards the UV 
(jFormiggini fc Broschl[200^ . Sometimes elliptical galax- 
ies with blue colors are found in the local universe (e.g., 
iKannappan et afl l2009| ). and this deviation from the 
trend is expe cted to grow ev en further with increasing 
redshift if, as iConselicd (|2006D argues, it is a coincidence 
that Hubble types correlate with colour in the nearby 
universe. At higher redshifts morphologically classified 
ellipti cals are o ften blue in colour a nd actively form ing 
stars (lConselica .2006: Huert as-Companv et. al.ll2009f ). 

Despite the advantages mentioned above, PCA 
presents a clear drawback. It does not provide proto- 
typical spectra to characterize the classes. The PCA 
eigenspectra do not resemble any member of the data 
to be classified and, in general, ei genspectra are of 
difficult physical interpretation (e.g.,jChan et al.l 120031 : 
iFormiggini fc BroschI [200l: lYip et al.l |200¥ ^ The ad- 
vantage of having classes characterized by prototypical 
spectra is clear. These few spectra can be studied in 
detail using standard diagnostic techniques developed 
for individual galaxies through the years. Then the at- 
tributes of the prototypical spectra can be passed on to 
the class members, or they can be used as intermedi- 
ate grid-points to interpolate the properties of the class 
members (see § [T0)) . Moreover, the differences between a 
particular galaxy and its class prototype allow for precise 
relative measurements. In an atte mpt to complement 
PCA with this feature, iChan et al.l (^003) developed an 
archetypal analysis algorithm. As the authors explain, 
it is like PCA but the eigenspectra are required to be 
members or mixtures of members of the input data set. 
However, by construction, the eigenspectra are extreme 



Classification of SDSS/DR7 galaxy spectra 



3 



data points lying on the data set outskirts. Although 
physically meaningful, the eigenspectra are outliers, and 
it may be difficult to connect their physical properties 
with those of typical galaxy spectra. Other improve- 
ments on the basic PCA technique are local linear embed- 
ding ( Vandcrplas fc Con nolly 2009,) , and ensemble learn- 
ing independent component analysis ()Lu et al.l I2OO60 . 
These extensions are computationally expensive, and so 
far they have been introduced only as proof-of-concept 
works. 

In addition to the superposition of stellar spectra 
and the PCA techniques described above, galaxies have 
been classified u s ing n euronal networks ( Folkes et al.l 
[ToM IMadgwickl [200l. massive lostless data com- 
pression (iReichardt et al. l l2001[ ). information bottleneck 
(jSlonim et al.l l2001h . and probably others. The algo- 
rithms have fiourished in response to the availability of 
new large spectral databases. We are still in an expand- 
ing phase, which should lead to a final convergence of 
the various techniques. The different methods seem to 
roughly coincide in the global picture, but it is so far 
unclear whether they agree in the details. 

2. THE CLASSIFICATION ALGORITHM 

In the context of classification algorithms, galaxy spec- 
tra are vectors in a high dimensional space, with as many 
dimensions as the number of wavelengths in use. The 
galaxy catalog to be classified is a set of vectors in this 
space, and so the (Euclidean) distance between any pair 
of them is well defined. Vectors (i.e., spectra) are as- 
sumed to be clustered around a number of cluster cen- 
ters. The classification problem consists in (a) finding the 
number of clusters, (b) finding the cluster centers, and 
(c) assigning each galaxy in the catalog to one of these 
centers. We employ the k-means algorith m to carry out 
this classification (see, e . g., Chapter 5 in fEveritdflQQSl : 
iBradlev fc FavvadI [19981 iSanchez Almeida et al.l I2OO90 . 
In the standard formulation, it begins by selecting at 
random from the full data set a number k of template 
spectra. Each template spectrum is assumed to be the 
center of a cluster, and each spectrum of the data set is 
assigned to the closest cluster center (i.e., that of mini- 
mum distance or, equivalently, closest in a least squares 
sense). Once all spectra in the dataset have been clas- 
sified, the cluster center is re-computed as the average 
of the spectra in the cluster. This procedure is iterated 
with the new cluster centers, and it finishes when no 
spectrum is re-classified in two consecutive steps. The 
number of clusters k is arbitrarily chosen but, in prac- 
tice, the results are insensitive to such selection since only 
a few clusters possess a significant number of members, 
so that the rest can be discarded. On exit, the algorithm 
provides a number of clusters, their corresponding clus- 
ter centers, as well as the classification of all the original 
spectra now assigned to one of the clusters. 

The algorithm is simple and fast, as required to treat 
large data sets. It assures that galaxies with similar spec- 
tra end up in the same cluster, and provides cluster cen- 
ters, i.e., prototypical spectra representative of all the 
galaxies in a cluster. In addition, it seems to work very 
well separating galaxy spectra, as inferred from the first 
test (see § [T]) , and from this work. Unfortunately, it has 
a major drawback. It yields different clusters with each 
random initialization. After pondering pros and cons, we 



decided to carry on with the algorithm, but not without 
evaluating the impact of the initialization on the classifi- 
cation. The impact is quantified and controlled through 
three complementary methods: (1) carrying out differ- 
ent random initializations and comparing their results, 
(2) assigning galaxies to several classes, each one with 
its own probability, and (3) trying alternative methods 
of initialization. The first point is dealt with in 12. H and 
§ m leading to classifications whose classes share some 
70% of the galaxies. It is not 100% because of spectra 
lying in between classes. This difficulty is to some extent 
cured by the second point, treated in § I2.3| which allows 
us to assign galaxies to several classes and, therefore, 
to identify galaxies in class borders. The third point is 
treated in the next paragraph, concluding that the scat- 
ter in the classification is not significantly modified by 
the mode of initialization. It does modify the timing, 
though. 

We tried several initi alization methods, includin g the 
standar d one. tha.t bv IBradlev fc FavvadI (|1998[ ). and 
others (jPefia et al.lll999l ). None of them seem to reduce 
the scatter due to the initial random seed (S I2.1l and §111). 
This behavior can be understood in terms of the exis- 
tence of borderline galaxies, as argued in Appendix |Al 
A proper initialization, however, reduces the iterations 
required to converge, and speeds up the procedure. We 
have adopted our own method, which is simple and fast, 
and it starts off with a reduced number of classes. It tries 
to select initial cluster centers according to the clusters 
that exist in the data set. If initial centers are purely 
chosen at random, then the clusters having the largest 
number of elements are overrepresented, and minor clus- 
ters may be even absent. The procedure works as fol- 
lows: (Step 1) choose at random a small set of initial 
cluster centers (say, 10). (Step 2) Run one iteration of 
the standard k-means, and select as initial cluster center 
the cluster center with the largest number of elements. 
(Step 3) Remove from the set of galaxies to be classified 
those belonging to the selected cluster center. (Step 4) 
Go to step 1 if galaxies are still left; otherwise end. In 
addition to this particular initialization, we tuned the 
standard k-means described above with one extra ingre- 
dient. The iteration loop ends when the classifications 
in two successive steps are sufficiently close one another, 
i.e., when 99% of the assignations do not vary between 
two iterations. This simplification speeds up the conver- 
gence since the classification of the remaining 1% takes 
a long time, and does not help finding the main galaxy 
classes, already well characterized by 99% of the sample. 

2.1. Testing the repeatability of the classification 

Two classifications of the same data set are identical 
if they include the same galaxies in each class. With 
this criterion in mind, we compare two different classifi- 
cations by pairing their two sets of classes according to 
the number of galaxies that they have in common. We 
compute the number of galaxies in common between each 
pair of classes formed by one class from one classification 
and the second class from the second classification. The 
two classes sharing the largest number of galaxies are as- 
sumed to be equivalent. The same criterion is repeated 
until all the classes of one of the classifications have been 
paired. Since the number of classes in the two classifica- 
tions are not necessarily the same, some classes remain 



4 



Sanchez Almeida et al. 



unpaired. This procedure tries to maximize the number 
of galaxies sharing the same class in the two classifica- 
tions. We use the percentage of galaxies in equivalent 
classes as a measurement of the agreement between the 
two classification, dubbing it coincidence rate. 

A first series of tests to check repeatability has been 
carried out using the 21493 quies cent blue compact dwarf 
galaxi es (QBCD) selected by iSanchez Almeida et al.l 
(l2009h from SDSS/DR6. Here we employ the same wave- 
length windows used in ISanchez Almeida et al.l (|2009l : 
they are labeled as QBCD in Table [1]). Thirty indepen- 
dent classifications of this data set yield a coincidence 
rate of 71 ± 9 %, with the error bar being the standard 
deviation. We think that the origin of these fluctuations 
in the number of common galaxies is due to the random 
initialization coupled with the large number of variables 
defining the spectra; see Appendix [X] This 70% coin- 
cidence has two consequences. (1) A galaxy chosen at 
random from the sample has a 70% chance of appearing 
in equivalent classes in two different runs of the classifica- 
tion. (2) The cluster centers are very well defined since, 
independently of the initialization, they share 70% of the 
galaxies that define them. The coincidence of the final 
classification of SDSS/DR7 is similar (but a bit smaller), 
as it is discussed in detail in § 

2.2. Testing the class recovery upon known classes 

We construct a set of mock observations to see whether 
the algorithm is able to recover clusters imposed on the 
data. To a data set of 21493 spectra (same number as in 
§ 12. ip , we add different amounts of pixel-to-pixel uncor- 
related random noise. The 21493 spectra were randomly 
selected among three real spectra representative of galax- 
ies along the color sequence in the blue cloud, the red se- 
quence, and the green valley (see §[7]). This 3-class mock 
observation encompasses the full range of spectra to be 
expected. When no noise is added, the algorithm returns 
3 classes, with 100 % coincidence between the original 
and the classified spectra. As the noise increases, the 
number of recovered classes increases too. The fact that 
new classes have appeared does not mean that the algo- 
rithm is malfunctioning. Actually, the algorithm seems 
to be classifying the noise. When the signal-to-noise ratio 
per pixel is ~ 10, which is typical of SDSS spectra, one 
retrieves some 10 classes. However, the spectra of the dif- 
ferent original classes are never mixed up, i.e., they end 
up in separate classes. The noise artificially increases the 
number of classes, but it does not wash them out. More- 
over, it is easy to figure out which classes are faked by 
this kind of random noise because, being pixel-to-pixel 
uncorrelated, it does not modify global properties of the 
spectra such as colors. Different classes with different 
colors are not artifacts created by noise. 

2.3. Assigning several classes to each galaxy 

Given a collection of galaxy spectra, the k-means algo- 
rithm infers a small set of classes or clusters, and assigns 
each spectrum to one of them. A number of reasons ad- 
vice addressing the inverse problem too, i.e., assigning 
classes to individual spectra once the classes are known. 
This alternative is required to classiiy spectra not used 
in the classification, which turns out to be a practical 
case of major interest (e.g., § lU § |6]). In addition, the 



classification does not provide unique sharp classes. The 
results in §[2T] suggest that a significant number of galax- 
ies are in between classes, and this fact could be easily 
acknowledged and quantified with a procedure to esti- 
mate the probability that a given galaxy belongs to each 
one of the known classes. Borderline galaxies must fit in 
several classes with similar probabilities. A general pro- 
cedure to carry out such multiple assignation is worked 
out in this section. 

The distance of spectrum s = (si, S2, . . .) to class cen- 
ter c = (ci,C2, . . .) is defined as, 

d(s - c) = |s - c| = (s, - c,)2] (1) 

i 

where Si and Ci are the values of the spectrum and the 
cluster center in the i—th wavelength pixel. The weights 
Wi allow one to select a subset among the pixels defining 
the spectra, i.e., 

_ r 0, in discarded wavelengths, /^,^ 
* ~ I m~^, in used wavelengths, ^ ' 

with m the total number of pixels where vui 0. k-means 
selects as class center the average of all the spectra be- 
longing to the class. Each one of these spectra has its 
own distance to the cluster center, so that the full set 
defines a distribution of distances to the cluster center 
for the spectra in the class. Let us call fc{d) the prob- 
ability density function (PDF) of such distribution, i.e., 
fdd) Ad is the probability of finding a galaxy in cluster 
c with a distance to the cluster center between d and 
d + Ad. The chances that galaxy s belongs to cluster c 
can be estimated as the probability of finding galaxies in 
the cluster with distances equal or larger than d{s — c), 
i.e., 

P(s,c) = / Ux)dx. (3) 

J(i(s-c) 

Unfortunately, sorting the classes according to their 
P(s, c) may be inconsistent with the assignation of 
classes made by k-means. Given a galaxy spectrum, 
k-means assigns to it the class of minimum distance, 
i.e., the class where d{s — Ck) < d{s — c)Vc. How- 
ever, there is no guarantee that the class of minimum 
distance coincides with the class of maximum proba- 
bility, i.e., in general ^ Cp with Cp defined so that 
P{s,Cp) > P(s,c)Vc. Consequently, the sorting of 
classes according to their probabilities cannot be used 
to order classes in a way that agrees with k-means. We 
circumvent the problem defining a merit function, based 
on probabilities, that can be used to judge the member- 
ship of a given galaxy to the various classes. As a main 
constraint, the class of minimum distance must have the 
largest merit, so that the ordering provided by this merit 
function agrees with the assignation made by the k-means 
algorithm. 

We call such merit function quality. For a spectrum s, 
assigned by k-means to class c^, the quality of class c is 
defined as, 

/•OO 

Q(c,s,Cfc) = / fdx)dx. (4) 

Given a spectrum, its membership to various classes is 
judged according to Q, with the class of largest Q the 



Classification of SDSS/DR7 galaxy spectra 



5 



main affiliation, the one of second largest Q, the second 
main, and so on. The sorting according to quality is 
actually a sorting according to distance to the cluster 
centers since, 

(3(ci,s,Cfc) > (9(c2,s,Cfc) 4=^ d{s - ci) < d{s - C2). 

This property guarantees that the class of minimum dis- 
tance has the largest quality and, therefore, the main 
class attending to its quality agrees with the k-means 
assignation. Equation ([Sj follows from Eq. (jlj because 
fc is always positive, and so, Q is a monotonic decreas- 
ing function of d{s — c). In addition to conforming to 
k-means, the quality Q has a number of practical prop- 
erties. Due to the normalization of the PDFs, 



fc{x)dx = 1, 



(6) 



the quality is always comprised between zero, for no 
match, and one, for perfect match, 



Q(C, S, Cfc) 



1, when d{c — s) — > 0, 
0, when d{c — s) 00. 



(7) 



The best quality, i.e., that of the class of minimum dis- 
tance, admits a simple interpretation. It is just the prob- 
ability that the galaxy belongs to this class because. 



(5(Cfe,S, Cfe) = P(s,Cfc). 



(8) 



If the best quality is large, then there are high chances 
that the galaxy is part of the class. If the best quality is 
similar to the second best quality, then the galaxy is in 
between classes. If the best quality is very small 1), 
then the galaxy is an outlier, meaning that it does not 
fit into any of the main classes. 

Computing qualities as explained above requires esti- 
mating the PDF for the distribution of distances within 
a class. This can be derived from the histogram of dis- 
tances to the cluster center among the galaxies that k- 
means has included within the class. After several trials, 
we found out that the observed cumulative distribution 
of distances can be very well approximated as. 



fc{x) dx ~ ai+a2 d+aj, exp [-[(fi-a4)/a5]^] , (9) 



where the five free coefficients (ai , . . . , 05) are determined 
by a non-linear least-squares fit of the empirical cumula- 
tive histogram of observed distances. Figure [T] shows one 
of such fits. 

Real qualities are represented in Fig. [2l which includes 
the results in one of the test classifications in § 12.11 Fig- 
ure[2l top, shows scatter plots of the 2nd-best quality ver- 
sus the best quality (top left), and the 3rd-best quality 
versus the best quality (top right). A significant num- 
ber of galaxies appear not far from the border where 
the best and 2nd-best qualities are equal and, therefore, 
many galaxies lie in between classes. Figure [21 bottom, 
shows the histograms of those qualities. Note how the 
best quality has a rather fiat distribution peaking at, 
say, 0.6. Note also how the number of galaxies with very 
small best quality is also very small. These are outliers 
whose spectra differ significantly from the characteristic 
spectra of the main classes. 








Fig. 1. — Example of cumulative distribution function of dis- 
tances to the class center. The symbols show the observed values 
whereas the solid line corresponds to the analytical representation 
used in the work. Distances are relative to the standard deviation. 




0.0 0,2 0,4 0,6 0,8 1,0 
Uest quality 



0,0 0,2 0,-i 6 OH 1,0 
Best Qualilv 



Best 
2nd-Best 
3rd-Best 




lOl 

0,0 



Quality 



FlG. 2. — Top: example of scatter plot of quality versus quality. 
Top-left: 2nd-best quality versus best quality. Points not far from 
the diagonal can be classified as either one of the two classes. Top- 
right: 3rd-best quality versus best quality. Bottom: histograms of 
qualities for the best quality class (the solid line), the 2nd-best class 
(the dotted line), and the 3rd best class (the dashed line). Note 
that even the best qualities are sometimes close to zero, implying 
that these galaxies are outliers of the classification. 



6 



Sanchez Almeida et al. 



3. THE DATA SET: SDSS/DR7 SPECTRA 

The SDSS/DR7 is the final major data realease of the 
SDSS project. Details about SPSS an d the DR7 can be 
found in, e.g., iStoughton et al.l ()2002D . lAbazaiian et al.l 
(|2009( ). and also in the thorough SDSS websitt^- The 
spectroscopic part of the survey contains some 930000 
galaxy spectra, and this full set is classified in our work. 
The basic properties of spectrograph and spectra will be 
summarized here, but we refer to the references given 
above for further details. The SDSS spectrograph has 
two independent arms, with a dichroic separating the 
blue beam and the red beam at 6150 A. It simultaneously 
renders a spectral range from 3800 A to 9250 A, with a 
spectral resolution between 1800 and 2200. The sampling 
is linear in logarithmic wavelength, with a mean disper- 
sion of 1.1 Apix~^ in the blue and 1.8Apix~^ in the red. 
Repeated 15 min exposure spectra are integrated to yield 
a S/N per pixel > 4 when the apparent magnitude in the 
g-band is 20.2. The spectrograph is fed by fibers which 
subtend about 3"on the sky. Most galaxies are larger 
than this size therefore the fibers tent to sample their 
central regions (e.g., 88% of the galaxies have effective 
radii larger than half the fiber diameter). 

Two adjustments are made on the original spectra be- 
fore classification. First, they are brought to rcstframe 
wavelengths using the redshifts provided by SDSS. This 
wavelength shift involves an interpolation, and we take 
advantage of this need to bring to a common wavelength 
scale all spectra, as required by the classification algo- 
rithm. The common scale has the same number of pixels 
as the original spectra (3850), and it is equispaced in log- 
arithmic wavelength from 3800 A to 9250 A. Obviously, 
the IR part of the spectrum is missing as the redshift 
increases, and we extrapolate it with a constant. Note, 
however, that this missing part is not used for classifica- 
tion (see below and § |4|) . The second manipulation is a 
global scaling applied after the resframe correction. The 
spectra are normalized to the flux in the g color filter (ef- 
fective wavelength ~ 4825 A) , a normalization factor that 
we compute for each spectra using the transmission curve 
provided by SDSS. This re-scaling automatically corrects 
for the flux dimming as sociated with the redshift (e.g., 
IBlanton fc Roweisll2007[) , but the original motivation was 
allowing comparison between galaxies of different abso- 
lute magnitudes. If the global scaling is not removed, 
the flux of the galaxy completely dominates the classifi- 
cation, and galaxies are split in bins of equal luminosity, 
rather than in spectral classes. 

No further correction has been applied to the data. We 
do not correct for extinction, seeing, galaxy size, aperture 
bias, etc. This apparent sloppiness actually results from 
a deliberate a ttitude towards classification, following the 
guidelines by iSandagd ()2005[ ) mentioned in § [1] If these 
corrections are important and the classification is work- 
ing properly, then the spectra of the same type of galaxy 
with and without an uncorrected bias should appear in 
separate bins. It is then a matter of a posteriori physical 
interpretation to infer what causes the different classes, 
and eventually join some of them when appropriate. 



' |http : //www ■ sdss . org/dr7 1 



24 



23 



20 



16 



14 



12 



-12 ■ 



-14 ■ 



-16 i 




-24 



0.0 0.2 0.4 0.6 0.8 
Redshift 



0.0 



0.2 0.4 0.6 0.8 
Redshift 



Fig. 3. — Scatter plots of apparent g magnitude (left) and ab- 
solute g magnitude (right) versus redshift for the galaxies in the 
SDSS/RD7. The plots include only a fraction of the galaxies cho- 
sen at random to avoid overcrowding. 



4. FINAL IMPLEMENTATION: THE 
CLASSIFICATION OF SDSS/DR7 

The spectra to be classified by k-means must share 
the same wavelength scale, i.e., the same sampling in- 
terval and the same wavelength range. SDSS/DR7 has 
a significant number of galaxies up to redshift 0.5 (see 
Fig. [3]). At this redshift the reddest restframe wavelength 
that SDSS provides is 6200 A, therefore, in order to use 
the full data set for classification, one should restrict the 
range of wavelengths down to 6200 A. Alternatively, one 
can restrict the range of redshifts of the galaxies. We 
have chosen the second possibility to avoid overlooking 
in the classification lines as important as Ha. The full 
set of spectra described in §[3] has been divided into a low 
redshift part (redshift < 0.25, with 788677 spectra) and 
a high redshift part (redshift > 0.25, with 138649 spec- 
tra) . The low redshift part is classified by means of the 
k-means algorithm, which provides the classes. Then the 
high redshift part is classified according to the classes 
derived from the low redshift part using the tools de- 
veloped in § 12.31 The reason for choosing 0.25 as the 
dividing redshift is twofold. First, the distribution of 
redshifts in SDSS/DR7 seems to present a discontinuous 
behavior at roughly this redshift (see Fig. |3l and also the 
discussion in §[9]). Second, and most important, 0.25 is 
the largest redshift that allows us to include for classifi- 
cation the near-IR TiO bands characteristic of M stars - 
the reddest restframe wavelength at this redshift is some 
7500 A. 

In addition to removing the reddest part of the spec- 
trum missed by the redshift of the galaxies, several rea- 
sons advice using only selected bandpasses to carry out 
the classification. Including too much continuum does 
not add information but dilutes the signals contained in 
the spectral lines. The number of wavelengths in the 
spectra sets the dimensions of the vectors to be classified. 
The larger the number of wavelengths the more compu- 
tationally demanding the classification, which makes it 
advisable limiting the number of wavelengths. Keeping 
these caveats in mind, we use for classification only the 
bandpasses shown as dotted lines in the bottom of Fig. |4l 



Classification of SDSS/DR7 galaxy spectra 



7 



TABLE 1 

Bandpasses used in the ask classification of 
SDSS/DR7 



From 


- To 


Comment 


4000 - 


4420 


QBCD blue, RSa, H5f, CNi, CN2, 
Ca4227, G4300, H7A, n-yp, Fe4383 


4452 - 


4474 


Ca4455 


4514 - 


4559 


Fe4531 


4634 - 


4720 


Fe4668 


4800 - 


5134 


QBCD green, H/3, Fe5015, Mgi 


5154 - 


5196 


Mg2, Mgi, 


5245 - 


5285 


Fe5270 


5312 - 


5352 


Fe5335 


5387 - 


5415 


Fe5406 


5696 - 


5720 


Fe5709 


5776 - 


5796 


Fe5782 


5876 - 


5909 


Na D 


5936 - 


5994 


TiOi 


6189 - 


6272 


Ti02 


6500 - 


6800 


QBCD red 


7000 - 


7300 


TiO band 


7500 - 


7700 


TiO band 



Note. — Wavelengths are in A. The Comment 
contains the names of the Lick indexes in the band- 
pass, plus additional information used to identify the 
bands in the main text. 



which are also listed in Table [TJ Except for a near-IR 
window between 8400 A a nd 8800 A, they include all the 
bandpasses employed by iSanchez Almeida et alj ()2009l 
§ 3) in the classification that triggered the present work 
(see §[l|). These bandpasses contain the main emission 
lines that trace activity (star formation and AGN activ- 
ity). Since they are distributed along the visible spec- 
trum, they also provide sensitivity to the colors of the 
galaxies. In addition, we include all the bandpasses of 
the Lick indexes, which were selected because they de- 
pend on the age an d metallicity of the stellar content of 
the g alaxie fl Worthev et al.lll994l: iWorthev k Ottavianil 
Il997| ). Finally, we include two windows at the location 
of TiO bands characteristic of M stars and early type 
galaxies (at 7150 A and 7600 A). These bandpasses are 
sensitive to the level of the near-IR continuum, and are 
tracers of old stellar populations. 

Initially, the computer resources needed to carry out 
the classification were unclear. The procedure is itera- 
tive (§ [2]), and the timing is mostly set by the number 
of iterations, which scales in a unknown fashion with the 
number of spectra and wavelengths (788677 x 1637). An 
exploratory procedure was written in Interactive Data 
Language (IDL), and it turned out to be faster than ex- 
pected since convergence occurs in, typically, less than 
50 iterations. Using a 8-core Intel Xeon 2.66 GHz ma- 
chine with 32 GB of RAM, 50 iterations last less than 
300 minutes. (The access to sufficient RAM was critical, 
since the array with the spectra to be classified occupies 
some 11.6 GB.) Even if fast, the IDL code does not allow 
us to carry out the battery of classifications required to 
study the dependence of the classification on the random 
initialization (§[2]). Fortunately, the k-means algorithm 

Using this argument to select bandpasses somehow conflicts 
with the philosophy of having a classification not driven by physics. 
However the confiict is only marginal. The lick indexes cover a 
large part of the spectrum, and we take all of them blindly. Us- 
ing the Lick bandpasses is only a particular way of enhancing the 
contribution of spectral lines with respect to continuum. 



can be parallelized, and we developed a second parallel 
version of the code using Fortran and the MPI (Message 
Passing Interface) library. The performance of the par- 
allel version is good. The algorithm scales very well, so 
that adding more CPUs implies a near to linear reduction 
in the execution time. A hundred executions of the paral- 
lel code using a cluster of 48 Intel Xeon CPUs (2.4 GHz) 
takes of the order of 1 hour. This figure outperforms the 
IDL code by a factor of 500. 

Aided with the parallel version of k-means^ we carry 
out 150 independent classifications of the data set. Be- 
cause of the random initialization, each one of these clas- 
sifications differs (§[2I)- Each run of the algorithm groups 
similar spectra in clusters so, in principle, all of them pro- 
vide valid classifications. Then the problem arises as to 
which one of these classifications is best, i.e., which one 
should be chosen as the classification. Ideally, one would 
like to choose a classification (1) with a small number of 
classes, (2) being representative of all classifications, and 
(3) having small dispersion within the classes. Condition 
(1) is obvious and will not be discussed further. Accord- 
ing to condition (2), we would like the classification to be 
as representative as possible of any other classification. 
Condition (3) demands that the spectra employed in de- 
riving the classification are as close as possible to a class 
center spectrum. Figure[S]shows scatter plots of three nu- 
merical coefficients that we devised to quantify the three 
requirements above. Figure [S^ and [5j: include the aver- 
age percentage of galaxies that a particular classification 
has in common with the other 149 classifications - it is 
just the percentage of galaxies in equivalent classes as 
defined in § 12.11 and it is labeled in the figures as coinci- 
dence. It spans between 62% and 71%. The coincidence 
is represented in Fig. [5^ versus the average dispersion of 
the classification, which is just the mean of distance be- 
tween galaxies and class centers defined in equation ([T]). 
The dispersion admits a simple interpretation: it is the 
typical difference per pixel between a spectrum in its 
class. (Because of the normalization, the spectra have 
their continuum at about one, therefore, dispersion 0.1 
corresponds to differences of the order of 10%.) Note 
that there is no obvious correlation between the two pa- 
rameters, but the classifications seem to cluster around 
two dispersions, the smallest being of the order of 0.16. 
Figure [5)d shows the scatter plot between the number 
classes in a classification (classes altogether containing 
99% of the galaxies) and the dispersion. Classifications 
having between 11 and 22 classes exist, with a typical 
value between 15 and 19. Again no obvious relationship 
between number of classes and dispersion is observed. Fi- 
nally, Fig. [5]: shows the scatter plot of coincidence versus 
number of classes. Attending to the three requirements 
above, we select those classifications having 

1. less than 18 classes, 

2. coincidence larger than 70%, 

3. dispersion smaller than 0.17. 

Four classifications fulfill these requirements. Lacking a 
better criterion, we choose one of them at random. The 
chosen classification turns out to have a coincidence of 
70.8%, a dispersion of 0.16, and it has 28 classes, but 17 
of them contain 99% of the galaxies. These 17 classes 



8 



Sanchez Almeida et al. 




Fig. 4. — Template spectra representing the major classes in the ASK classification of the SDSS/DR7 galaxies. The different spectra 
have been artificially shifted upward according to their u — g color. (Otherwise the plot beconies overcrowded.) The numbers next to the 
spectra (in the left hand side of the plot) correspond to the class immbcr, which was assigned according to the u — g color (ASK for 
the reddest, ASK 1 for the second reddest, and so on up to ASK 22). Gaps in the numbering indicate the presence of minor classes of 
intermediate colors. The fluxes Eire in dimensionless units, i.e., they are normalized to the average flux in the g-filter bandpass. Wavelengths 
are given in /im. The dotted line differs from zero at the wavelengths used for classification. 



Classification of SDSS/DR7 galaxy spectra 



9 




(a) 



55 
0.10 



0.30 
Dispersion 



0.25 0.30 



10 t . 

0.10 



□ 

□ □ 
I II llllll 

□ irim 
□ II III mil I 

UHU 
n iiiiniiiiiii 
rmn 

□ I I mill 

□ □ □ 

□ 



□ 

n 

nnnTi 
□□n 
□ nc 

□ 
□ 



0.20 
Dispersion 



0.25 0.30 



•) 65 



S eo- 



H □ 

□ □ 
□ □ □ 



10 12 14 16 18 20 22 

Classes with 99% of Galaxies 

Fig. 5. — Scatter plots with the three parameters characterizing 
the 150 different classifications from which we have drawn the final 
one in Fig. (4] (a) Percentage of galaxies common to all other 
classifications (coincidence) versus typical scatter of the spectra 
with respect to the class spectrum (dispersion), (b) Number of 
major classes versus dispersion, (c) Coincidence versus number of 
major classes. We select classifications having coincidence > 70%, 
dispersion < 0.17, and 17 classes or less. 



are denoted in the paper as major classes. The spectra 
of the major classes are shown in Fig.|4l They have been 
labeled according to the u — g color, from the reddest, 
ASK 0, to the bluest, ASK 27. By using numbers to 
label the classes we are not implicitly assuming that the 
spectra represent a one dimensional family. The numbers 
are only tags to name the classes. 

Figure [6^ shows the colors characteristic of the ASK 
classes. The number of elements corresponding to each 
class is included in Fig. |6Jd- The horizontal dotted line 
in this figure indicates the threshold for major class, i.e., 
classes with a number of elements above this threshold 
contain 99% of the classified galaxies. Their spectra are 
those shown in Fig. [4] The main properties of all classes 
are summarized in Table [2j A blow up with the bluest 
part of Fig. [4] is included in Fig. [71 where some of the 
characteristic emission and absorption features are la- 
beled. Note how the spectra vary gradually with the class 
number. Even the smallest ripples in these average spec- 
tra are real. Upon averaging, the signal-to-noise ratio is 
expected to increase as the square root of the number of 



o 



2,0 



1.5 



1.0 



0.5 



0.0 



-0.5 



□□□ 



u-g □ 
g-r O : 
r-i * 



(a) 



5 10 15 20 25 
ASK Class # 



10- 



B 
S 

o 



10* 



10-^ 



10' 



10 15 20 
ASK Class # 



25 



Fig. 6. — (a) Color versus ASK class number. Class numbers 
have been assigned according to the u — g color of the template 
spectra, which explains the monotonous decrease of this color with 
class number. The larger the class number the bluer the galaxy. 
Colors g — r and r — i are also included as indicated by the inset, 
(b) Histogram of the number of galaxies existing in each class. The 
horizontal dotted line shows the threshold that separates major 
classes from the rest (i.e., classes having altogether 99% of the 
classified galaxies). The colors and number of members shown in 
these figures are listed in Table [2] 



class members. The major class with less members still 
has ^ 5000 elements (Fig. [6)3) , which sets a lower limit 
to the S/N per pixel of ^ 700. The systematic change 
of global properties along the sequence is important to 
constrain the effects of noise on the number of classes 
(§ 12. ip . Altough noise artificially increases the number 
of classes, it does not change in a systematic way global 
spectral properties such as colors. Except perhaps at 
the blue end of the classification, the colors of classes 
vary systematically along the sequence (Fig lU see also 
§ [7]) , which discard any significant infiuence of the ran- 
dom pixel-to-pixel uncorrelated noise on the number of 
classes. 

As we explained in the first paragraph of the sec- 
tion, the full set of galaxies was split into two parts. 




The low redshift part has been used to derive spectral 
classes, which automatically leads to its classification as 
explained above. Based on these classes, and using the 
procedure developed in § 12.31 we extended the classifi- 
cation to the high redshift subset. The use of the same 
classes is an assumption which, however, seems to be se- 
cure since the properties of the galaxies thus classified do 
not show any systematic difference with respect to the 
low redshift subset (see § Moreover, the procedure 
in § 12.31 has been also applied to the low redshift part, 
already classified by k-means. It provides qualities for all 
classes, which permits identifying borderline galaxies and 
outliers, and it allows us deriving physical parameters by 
interpolation (j; [TU)) . 

The success of k-means does not imply the existence of 
well defined clusters in the 1637-dimensional classifica- 
tion space. As we discuss above, the separation between 
classes is not sharp. Galaxies are often close to the bor- 
ders, which explains the variability between different re- 
alizations of the classification (§ 12.11) . The presence of 
many borderline galaxies seems to imply a rather con- 
tinuous distribution of points in the classification space, 
that k-means astutely partakes assuring the elements of 
each class to be similar. Generally speaking, classes 
should not be associated with true clusters in the classi- 
fication space. However, some of the classes seem to rep- 



resent genuine clusters as judged from the distribution of 
qualities. Qualities were introduced in ii l2.3l to character- 
ize the membership of each galaxy to the classes. Galax- 
ies next to class borders have similar best quality and 
2nd-best quality. Therefore, if the galaxies in a class are 
of this kind, then the class cannot portray a clear cluster. 
Conversely, classes corresponding to well defined clusters 
have their members separated from the other classes, i.e., 
their galaxies tend to have a best quality larger than the 
2nd-best quality. This condition is met by some of the 
classes, indicating clustering. Thus we use the ratio be- 
tween best and 2nd-best qualities to assign a degree of 
grouping to the different classes. Figure [8] shows his- 
tograms of the ratio between the best quality and the 
2nd best quality for the major classes. Some of the his- 
tograms have a clear peak at a ratio significantly smaller 
than one, implying clustering (e.g., ASK 2). Other his- 
tograms present a fiat distribution (e.g., ASK 0), whereas 
a minority of classes show most of their members having 
similar best and 2nd-best qualities (e.g., ASK 13). At- 
tending to the shape of these histograms, the clustering 
of each class has been labeled as good, neutral or bad; 
see the last column of Table Note that the clustering 
tends to be good or neutral rather than bad. 



Classification of SDSS/DR7 galaxy spectra 



11 



TABLE 2 

Main properties of the ASK classes 



ASK class! 


members 




g-r^ 


r — 


H/33 


[OIII]A5007^ 


Ha3 


[NII]A65833 


% emission^ 


Clustering 





111447 


1.59 


0.91 


0.44 


-0.3 


0.7 


0.9 


2.0 


9 


neutral 


1* 


18032 


1.57 


1.05 


0.53 


0.3 


1.1 


2.9 


3.2 


29 


neutral 


2* 




1.57 


0.84 


0.39 


-0.4 


0.4 


0.1 


1.3 


2 


good 


3* 


80530 


1.37 


0.75 


0.35 


-0.7 


0.6 


0.7 


1.6 


8 


neutral 


4* 


20456 


1.20 


0.91 


0.45 


2.7 


2.3 


13.9 


8.2 


95 


baa 


5* 


boozb 


1.18 


0.80 


0.38 


1.5 


1.6 


8.5 


5.3 


/8 


neutral 


o 


4DDy 


l.ii 


U. M 


U.o4 


Q ^ 
0.0 


on 


1 /I 


inn 

lu.y 


nc 
yo 


good 


7 


1089 


1.11 


0.71 


0.29 


8.7 


84.7 


22.5 


15.2 


100 


good 


8 


211 


1.06 


0.59 


0.18 


22.8 


233.1 


36.0 


15.5 


95 


good 


9* 


71671 


1.00 


0.68 


0.32 


2.4 


1.6 


12.0 


5.8 


89 


neutral 


10* 


32227 


0.95 


0.72 


0.34 


5.1 


2.7 


24.2 


11.1 


99 


neutral 


11* 


6369 


0.91 


0.78 


0.33 


9.7 


5.4 


42.2 


19.8 


100 


good 


12* 


51314 


0.83 


0.58 


0.26 


5.4 


3.0 


24.5 


9.3 


98 


good 


13* 


26705 


0.81 


0.50 


0.24 


2.6 


3.0 


11.2 


3.9 


81 


baa 


14* 


25026 


0.72 


0.52 


0.21 


10.2 


6.1 


45.7 


16.1 


99 


good 


15 


68 


0.70 


-0.23 


-0.57 


176.4 


743.5 


715.3 


14.1 


25 


good 


16* 


26504 


0.67 


0.40 


0.17 


7.1 


8.8 


30.4 


7.1 


99 


neutral 


17 


134 


0.65 


-0.15 


-0.51 


161.7 


630.1 


549.9 


16.4 


35 


good 


18* 


5687 


0.61 


0.48 


0.13 


19.5 


18.5 


83.9 


24.9 


100 


good 


19* 


12808 


0.61 


0.36 


0.13 


13.0 


21.6 


54.6 


9.8 


100 


good 


20 


238 


0.56 


-0.03 


-0.31 


105.5 


492.9 


408.2 


17.6 


81 


good 


21 


179 


0.55 


-0.03 


-0.29 


97.1 


461.5 


356.7 


16.6 


69 


good 


22* 


4781 


0.55 


0.28 


0.08 


19.8 


48.3 


82.5 


9.7 


100 


good 


23 


2366 


0.54 


0.34 


0.04 


31.5 


61.1 


130.7 


22.8 


89 


neutral 


24 


1910 


0.51 


0.22 


0.00 


34.5 


106.6 


148.5 


12.9 


98 


good 


25 


488 


0.51 


0.08 


-0.16 


72.3 


253.8 


302.7 


18.5 


94 


good 


26 


986 


0.50 


0.19 


-0.07 


51.7 


159.9 


219.3 


19.6 


100 


good 


27 


199 


0.48 


0.13 


-0.16 


67.4 


230.7 


278.8 


19.7 


85 


good 



! The asterisks denote major classes, i.e., those that altogether include 99% of the galaxies. 

^ The colors have been computed from the template spectra using the appropriate SDSS bandpasses. 

^ Equivalent width in the template spectra given in A. Negative implies line in absorption. 

* Percentage of galaxies in the class with H/3 in emission according to the SDSS/DR7 catalog. 



5. RELATIONSHIP BETWEEN ASK CLASS AND 
PCA CLASSIFICATION 

SDSS/DR7 already provides a spectral classification 
based on PCA, which is a linear expansion of each spec- 
trum in terms of a small number of eigenspectra (§ [1]). 
The eigenspectra for the SDSS expansion were derived 
from a sub set of app r oxima tely 200000 galaxies, as ex- 
plaine d by lYip et al J (120041) FoUowing rConnoUv et"all 
(fl995[) and others. lYip eTal] (|200l use a diagnostic plot 
to separate spectral classes based on the three first eigen- 
values, a 1,02 and 03. Extreme emission galaxies, early 
type galaxies, and late type galaxies can be distinguished 
in the (f>Kh versus 0kl plane, where the two mixing angles 
are defined as 

4>KL =arctan(a2/ai), 

^KL = arccos(a3). (10) 

Figure IH] shows this diagnostic plot for all SDSS/DR7 
galaxies, and for the major ASK classes separately. We 
have used the PCA eigenvalues directly provided by 
SDSS/DR7. The ASK classes occupy well defined places 
in the PCA diagnostic plot, which implies that the PCA 
classification and the ASK classification are consistent. 
Given the ASK class of a galaxy, one can predict its loca- 
tion in the PCA plane. The opposite does not hold since 
some ASK classes overlap in the PCA diagnostic plot (cf. 
ASK 9 and ASK 10). The ASK classification is more 
refined; it simply includes more classes than PCA, there- 
fore, the two classifications are consistent but not equiva- 
lent. The location of early type galaxies, late type galax- 



ies, an d extreme emission galaxies made by lYip et ahl 
(j2004[) is also included in Fig. [HI (top left panel, symbols 
et. It, and ee, respectively). One can see how this rough 
PCA based separation is also consistent with the ASK 
classes. There is a systematic trend to go from the lo- 
cation of the early types to the late types as the ASK 
class number increases. This behavior coincides with the 
trend to be derived from the morphological classification 
in § ini The region of extreme emission galaxies deserves 
a separate comment. Note that the galaxies appearing 
in this region do not show up among the galaxies in the 
major classes included in Fig. [HI These extreme galaxies 
belong to the minor classes with high ASK class number 
(not shown), i.e., the bluest among the ASK classes. The 
points clearly outside the contour in Fig. [5] are partly in- 
cluded in the ASK classes next to them, and partly in 
additional minor classes (not shown). ASK 2 seems to 
be the only exception. It includes a few galaxies in the 
extreme emission region, and we have not been able to 
pin down the cause. However, the fact that ASK 2 shows 
more outliers then other classes is probably an artifact 
due to ASK 2 being the most common class (Table [5]). If 
all classes include a similar fraction of outliers, they will 
be more conspicuous in scatter plots of ASK 2. 

In short, although ASK is more refined, ASK and PCA 
seem to agree with small internal scattering. Moreover, 
the scatter between these two purely spectroscopic clas- 
sifications is much smaller than the scatter in the ASK 
versus morphological classification analyzed in the next 
section. 



12 



Sanchez Almeida et al. 



ASK 



0-0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 



1200 - 
1000 - 
800 - 
600 - 
400 - 
200 - 



0,0 



ASK 1 



0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best, Quality 



3,5x10* ' 
2,0x10* : 
1,5x101 : 
1,0x10* - 
5,0x103 : 




ASK 



0,0 0,2 0,4 0,6 0,8 
2nd-Best Quality/Best Quality 



5000 r 
4000 r 
3000 - 
2000 r 



ASK 3 



1000 - 

0^ ^ ^ ^ 

0,0 0,2 0,4 0,6 0,8 



1,0 



2nd-Best Quality/Best Quality 





1400 




1200 


E 

CO 


1000 


si) 


800 








600 




400 




200 



ASK 



0,0 0,2 0,4 0,6 O.B 1,0 
2nd-Best Quality/Best Quality 




0,2 0,4 0,6 0,8 1,0 
-Best Quality/Best Quality 



400 
200 









5000 








E 4000 
















M 3000 
















ASK 6 








1000 










0,0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 



ASK 9 



0,0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 



3000 




0,0 0,2 0,4 0,6 0,8 1,0 0,0 0,2 0,4 0,6 0,8 1,0 0,0 0,2 0,4 0,6 0,8 1,0 0,0 0,2 0,4 0,6 0,8 1,0 

2nd-Bcst Quality/Best Quality 2nd-Bcst Quality/Best Quality 2nd-Bcst Quality/Best Quality 2rid-Bcst Quality/Best Quality 



ASK 14 



0.0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 



ASK 16 



0,0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 





600 


E 


500 




400 






O 


300 


X 


200 




100 




0,0 

2nd- 



ASK 18 



0,2 0,4 0,6 0,8 1,0 
Best Quality/Best Quality 




0.0 0,2 0,4 0,6 0,8 1,0 
2nd-Best Quality/Best Quality 



Fig. 8. — Histograms of the ratio between the best and the 2nd-best quaUties for the major classes. Those classes corresponding to 
proper clusters in the 1637-dimensional classification space should have a distribution of ratios peaking away from one (e.g., ASK 2). Only 
galaxies whose best quality is larger than 0.2 have been considered. 



6. RELATIONSHIP BETWEEN ASK CLASS AND 
HUBBLE TYPE 

The morphological type of a galaxy (Hubble type) is 
closely related to its spectrum, a relationship known for 
long (see §[T]). The analysis of such relationship in the 
case of ASK is mandatory, and we will do it in a follow-up 
work where the morphology of a large number of galax- 
ies is derived automatically (§ llip . However, in order 
to show the consistency of the ASK classification, we 
include here a preamble based on a limited number of 
galaxies which shows how early types are associated with 
small ASK numbers, and vice-versa. 

We have ASK- classificd the galaxies in the spectral at- 
las of lKennicuttl ([T992) . He provides spatially integrated 
spectra (from 3650 A to 7100 A) of a set of 55 nearby 
galaxies with known Hubble types. The set contains all 
Hubble types, from gigant ellipticals (cD, NGC 1275) to 
dwarf irregulars (dl, Mkr35). We assign each galaxy to 
the ASK class whose spectrum is closest to the galaxy 
spectrum as explained in SI2.3[ T he match between ASK 
template spectra and iKennicut^ spectra is illustrated in 
Fig. [TOl which contains representative spectra of an early 
type galaxy and a late type galaxy. These particular fits 
ignore the spectral regions with emission lines (see the 
weights shown as a dotted line in the figures). The scat- 



ter plot of the assignation i s shown in Fig. |lll It displays 
the Hubble type given by iKennicut^ ()1992D versus the 
ASK class for the galaxies in the atlas. (Actually, for 
53 out of the 55 original galaxies, since Mrk3 is not in 
the electronic catalog, and NGC 3303 has no clear Hubble 
type - it presents two nuclei undergoing a major merger.) 
Note the clear trend for the small ASK numbers to be 
associated with early types and vice-versa. The dividing 
line between early types (E,. . . SO) and late types (Sa, 
SBa,...I) seems to be about ASK 6, so that numbers 
smaller than this limit correspond to early types. The 
trend is even more cl ear if one ignores those galaxies clas- 
sified as peculiar by iKennicutg (|1992f) , which are shown 
in the figure as asterisks. However, one cannot ignore 
the large scatter in the figure - there is no one-to-one 
relationship between spectroscopic class and morpholog- 
ical class. The conclusion that there is a general trend 
with large scatter is very much in the vein of all previous 
studies compari ng spectroscopic and morphological clas- 
sifications fe.g.. lZaritskv et al]|1995l : lConselicell2006l and 
§ II. ip . Actually, the relationship between morphological 
and spectroscopic types gets fuzzier with increasing look- 
back t ime, and perhaps it disappears in the early U ni- 
verse (|Conselicell2"006l: lHuertas-Companv et. al.ll2009D . 

We have repeated the above exercise usi ng the eye-bold 
morphological classification presented by iFukugita et al.l 



Classification of SDSS/DR7 galaxy spectra 



13 





140 



80 100 120 140 80 100 120 140 




140 



80 100 120 140 80 100 120 140 




Fig. 9. — PCA diagnostic plot for 50000 randomly chosen galaxies in the SDSS/DR7. The full set is included in the top left panel , where 
we als o mark the location of the early type (et), late type (It) and extreme emission (ee) galaxies according to the separation in Yip e t al.l 
(|2004l ) . The other plots show the different major ASK classes individually (see the insets). Note how the ASK classes occupy well defined 
places in the PCA diagnostic plot. Given the ASK class of a galaxy, one can predict its location in the PCA plane. The opposite does not 
hold in general. All plots include the same contour indicating the boundaries of the full distribution. The two mixing angles 9kl a-^d ipKh 
are given in degrees. Major classes 19 and 22 are not shown because they do not fit into the figure, but they follow the sequence. 



(|2007f) . It is based on bright galaxies in a north equato- 
rial stripe from SDSS/DR3, visually classified by three 
different observers based on ij-band images. The cata- 
log contains 2253 galaxies, but only 1866 targets have 
spectra and so overlap with our classification. We have 
also used this set to compare the morphological Hubble 
type and the ASK c lassification, with results similar to 
those for IKennicuttI galaxies. There is a global trend 
with significant scatter. The size of the set allowed us 
to discard several observational bias that may cause the 
scatter. It is not due to misclassifications. The scat- 
ter is not reduced upon using only high quality ASK 
class determinations (quality > 0.8; 12 3p . or when Im 
and peculiar galaxies are excluded from the sample. The 
scatter remains considering only small galaxies contained 
within the spectroscopic fiber (< 1.5" effective radius). 
The last test assures that the scatter is not produced 
by large spirals misclassified because the SDSS spectrum 
just samples their (red) bulge. 



7. ASK CLASSES AND THE BIMODAL COLOR 
DISTRIBUTION 



The colors of the galaxies follow a bimodal distri- 
bution (e.g. iStrateva et all 120011: iBalogh et ahl 120041 : 
IBaldrv et al]l2004n . with a red population (the red se- 
quence), a blue population (the blue cl oud ), and the so- 
called green valley in between (e.g., Salim e t al. 2007). 
The two main populations are believed to represent pas- 
sively evolving red galaxies and blue star-forming galax- 
ies, with galaxies in transition forming the green valley. 
As we explain in § [U this work was partly triggered by 
the ability of k-means to distingu ish green valley spec- 
tra ([Sanchez Almeida et al.|[2009D . Therefore, we found 
it necessary to discuss the location of the ASK classes 
in a plot where the red and the blue populations show 
up se parately (e.g.. iBershadv et al.ll2000t IStrateva et al.l 
[200l . 

Figure [12] (top left panel) shows the distribution of all 
SDSS/DR7 galaxies in a u—g versus g—r plot. The image 
represents the 2-dimensional histogram of the distribu- 
tion of colors. The concentrated spot at g — r ~ 0.8 and 
u — g ~ 1.7 corresponds to the red sequence. The blue 
cloud appears in this representation as an extended tail. 
The other panels in Fig. [T2| show the different classes 
separately, and they all include the 2-dimensional his- 



14 



Sanchez Almeida et al. 



2.5 I 

'-_ □ □ □ □ □ NGC3379 EO 

2-° - ASK class 2 

- Weight 



0.5 : 




0.0 K , : I , I : ; 

4000 5000 6000 7000 

Wavelength [A] 




0.0 K . : ; . . : ; 

4000 5000 6000 7000 

Wavelength [A] 



Fig. 10. — Two representative examples of the fits between ASK 
class spectra and galaxies in the atlas by Kennicutt ( 1992). Galaxy 
names, Hubble types, and ASK numbers arc included in the insets. 
The dotted lines correspond to the weights used for fitting - wave- 
lengths where the weight is zero have been ignored. Wavelengths 
are given in A. Symbols and solid lines correspond to IKennicum 
spectra and ASK class spectra, respectively. 

togram for reference. An inspection to Fig. 1121 reveals a 
number of properties. First, the ASK classification sepa- 
rates galaxies into well located positions of the color-color 
plane. The red cloud is characterized by the most numer- 
ous class, ASK 2. ASK and ASK 1 also belong to the 
red sequence, but to its outskirts. The blue cloud is split 
into several classes, starting with ASK 9 and continuing 
with higher ASK classes. In between these two groups, 
ASK 3, ASK 5 and ASK 6 populate the green valley - 
ASK 4 seems to be made of outliers of the main relation- 
hip. As we originally presumed, the ASK classification 
separates galaxies in colors with a finesse to automati- 
cally pinpoint classes in the green valley. An in-depth 
analysis of the galaxies in these classes will be carried 
out in a follow-up work (see § [TT|) . 

8. RELATIONSHIP BETWEEN ASK CLASS AND 
AGN ACTIVITY 

We have studied the po sition of our classes on the 
BPT diagram (named after [Baldwin et allllQSlf) . which 
is commonly used to separate Active Galactic Nucleus 
(AGN) activity from normal star f ormation activity 
in ga laxies with emission lines (e.g., lKaufl[mann et al] 
120031 ). The diagnostic diagram consists of a scat- 
ter plot of the ratio of fluxes [OIII] A5007/II/3 versus 
[NII]A6583/IIq;. The two pairs of emission lines are so 
close in wavelength that the BPT diagram is almost in- 
sensitive to extinction and other systematic photomet- 
ric miscalibrations. Figure [131 top left panel, contains 



the BPT diagram for the full set of galaxies with emis- 
sion lines. The fluxes of the lines have been directly 
taken from SDSS/DR7. The figure includes a curved 
solid line dividing star-forming galaxies (below the line) 
and AGNs (ab ove the line). Th is separation was worked 
out by Kauff mann et al.l ()2003l ). where they also distin- 
guish between different types of AGNs. The straight 
line separates the regions occupied by Seyfert galaxies, 
and LINERs, as indicated by the insets. The figure also 
shows BPT plots for the galaxies belonging to the indi- 
vidual ASK classes - see the class in the label on top of 
each plot. The panels for the classes include box sym- 
bols at the positions where class template spectra show 
up. They are barely visible because they always appear 
in the center of the cloud of points corresponding to the 
individual galaxies. (ASK 0, 2, and 3 do not have such 
boxes since they present 11/3 in absoption; see Table [21) 
Only a small fraction of red galaxies have emission 
lines that can be used to place them in the BPT dia- 
gram (2% for ASK 2; see Table [2]), however, when they 
do, their emission corresponds to AGN activity (Fig. [131 
ASK 0-2). This result is very much in agreement with 
the current views that host galaxies of AGNs are prefer- 
entially early type galaxies ([Kauffmann et all [2003. and 
references therein). Conversely, blue galaxies correspond 
to star-forming galaxies, with little sign (if any) of under- 
going AGN activity (from ASK 9 on; see Fig. [HI). The 
galaxies that seems to be in the green valley (ASK 3, 
ASK 5 and ASK 6; § [T]), also appear in the BPT dia- 
grams in the region of ACNs. This is again consistent 
with the current wisdom that AGN activity quenches 
star formation, and so it may be respon sible for the tran- 
sit of galax ies across the green valley (jSchawinski et all 
120071 12009f ) The case of ASK 6 deserves special atten- 
tion. According to the position in the BPT diagram, 
there is little doubt that it is formed by Seyfert galaxies. 
This is consistent with the shape of Ha in the class tem- 
plate spectrum, with very broad wings that extend up to 
2000 kms~^. Mo reover the only galaxy in the catalog by 
iKennicufrB (|1992( ) classified as ASK 6 is a well known cD 
elliptical with a Seyfert nucleus (NGC 1275; see Fig. [TT|) . 

9. RELATIONSHIP BETWEEN ASK CLASS AND 
REDSHIFT 

Cone diagrams (or pie plots) are polar plots where 
radius is redshift and azimuth is right ascension (e.g., 
Folkes et al. 1999). Figure [TH shows cone diagrams 
for four representative classes, i.e., a red galaxy class 
(ASK 2), an AGN class (ASK 6), and two blue galaxy 
classes (ASK 9 and ASK 16). The range of declinations 
is limited between 35° and 45°. From Fig. [141 and similar 
plots considering other classes, other range of redshifts, 
and other declinations, we draw the following conclu- 
sions. ASK 0, 1, 2, and 3 are observed at higher redshifts 
than the rest of the classes. This effect is partly due to 
the luminous red galaxy (LR G) extension of the main 
SDSS spectroscopic sample (Ei senstein et al.ir2001|) . The 
LRG search has been designed to detect passively evolv- 
ing red galaxies, and it includes galaxies fainter than the 
main flux-limited portion of the SDSS galaxy spectro- 
scopic sample (down to r ~ 19.5, rather than the regular 
cutoff at r ~ 17.8). However, a part of the separation in 
redshift between blue galaxies and red galaxies is believed 
to be real. Dwarf galaxies cannot be observed at high 



Classification of SDSS/DR7 galaxy spectra 



15 



13 5 7 



9 11 13 15 17 19 21 23 25 



I,Im,Sm,SBm,dI 
SBcd.SBd 
Sc.Scd 
Sb,SBb,Sbc 
Sa,SBa,Sab 
SO,SBO,SBO/a 
B4 
E3 
El/so 
cD,EO 











L 


1 


i: 


] 












] 






: 






E 




E 






































































-1 






: 


------ 


C 


1 










i 


s 






































□ 










I 








: 


] □ 






































I 
























--^ 






























































































































































□ 













































10 12 14 16 18 20 22 24 
ASK Class Number 



Fig. 11. — Scatter plot of Hubble type versus ASK class for the galaxies in the atlas by IKennicutH 1)19921 ). The plot contains 53 out 
of the 55 galaxies in the atlas — Mrk3 has no spectrum in the electronic catalog, and NGC 3303 belongs to an undefined class since it is 
undergoing a major merger. Note the clear trend for the small ASK class numbers to be associated with early types, and vice-versa. The 
trend is even more clear if one ignores those galaxies classified as peculiar by Kcnnicutt (1992) (the asterisks). However, one cannot ignore 
the scatter - there is no one-to-one relationship between spectroscopic class and morphological class. In order to avoid the overlapping of 
the galaxies with the same Hubble type and ASK class, we have added a small artificial random vertical shift to all points. 



redshift, but dwarf field gala xies tend to be starforni- 
ing fe.g.. iHeavens at aLll2004l) . and so included among 
the bluest ASK classes. It is therefore understandable 
why blue ASK classes are biases towards lower redshifts. 
Proper motions induced by the large gravitational po- 
tential of galaxy clusters lead to the so-called fingers of 
god in the cone diagrams, i.e., elongated clumps with the 
major axis pointing toward the observer (e.g., iJacksoiJ 
I1972D . We find them preferentially in ASK 2 cone dia- 
grams, which we interpret as an inclination for the red 
galaxies to be in clusters. ASK 6 is formed by Seyfert 
galaxies, and it seems to be more spreadout than the 
other classes (see Fig. [H)) . Finally, we find not distinct 
or sharp change of properties at redshift 0.25, i.e., at the 
divide used to split the classification (see § |4]) . Galaxies 
at redshifts larger than this value were not used to derive 
the classes. The featureless transition at this special red- 
shift indicates no obvious systematic difference between 
the galaxies that define the classes, and the rest. 

10. RETRIEVING PHYSICAL PROPERTIES OF 
INDIVIDUAL GALAXIES BY INTERPOLATION 

One can foresee several applications of the classifica- 
tion, in particular, it can be used to derive non-trivial 
physical parameters of individual galaxies by interpola- 
tion of the properties of the classes. We want to measure 
the parameter X for the galaxy s. Assume that the pa- 
rameter X varies systematically along the ASK sequence, 
being Xi in the i-th. class. Then, one can approximate 



X for galaxy s as, 

27 27 

X(s)~^Q,(s)X,/^Q,(s), (11) 

1=0 2 = 

where Qi{s) represent the qualities assigned to the galaxy 
as explained in § 12.31 except that we have shorten the 
notation so that, 

Q.(s) = Q(ci,s,Ck). (12) 

In practice, the series (jlip can be truncated to consider 
only a few terms where the quality is large enough. Re- 
gardless of the complications to estimate a given parame- 
ter, by having the classification and the value of the phys- 
ical parameter in the ASK classes, equation ([TT|) trivially 
provides the parameter for all SDSS/DR7. We have used 
the Star Formation Rate (SFR) to illustrate the proce- 
dure (Fig. [T5| . The equivalent width of Ha is a proxy 
for Specific SFR (or SSFR) re.g. lKennicutilllQQS l). and it 
varies systematically along the ASK sequence (Table [2]) . 
Using the empirical r elationship betwe en Ha flux and 
SFR as calibrated bv iKennicutl H1998D . Fig. [H] shows 
a scatter plot with the SFRs obtained directly and by 
interpolation based on equation ()lip . The figure consid- 
ers only starforming galaxies, i.e., ASK > 7 according to 
§ m We truncate the series using only the three classes 
of highest qualities. Figure US] shows that interpolated 
SFRs are correct within a factor of two for SFRs varying 
four orders of magnitude. Unfortunately, the interpo- 
lation does not work in all cases. We failed to estimate 



16 



Sanchez Almeida et al. 



ASK ASK 1 ASK 2 




0.0 0,5 1,0 0,0 0,5 1,0 0,0 0,5 1,0 0,0 0,5 1,0 

g-r g-r g-T g-r 



Fig. 12. — Plots of u — g versus g — r for the galaxies belonging the major ASK classes. The top left panel contains an image with the 
2-dimensional distribution of colors in the full SDSS/DR7. The remaining panels show the individual classes separately, as indicated in 
the labels, together with the 2-dimensional histogram for reference. Major classes 19 and 22 are not shown because they do not fit into the 
figure, but they follow the trend. 



metallicities by interpolation. The oxygen metallicity has 
a large dispersion within each ASK class and, therefore, 
it does not meet the condition of varying systematically 
along the ASK sequence. 

The interpolation may be specially useful when dealing 
with high redshift objects for which only noisy spectra 
are available. Once each spectrum is assigned to a par- 
ticular class, one can assign all the properties of the class 
to the spectrum. In this sense, one can use the spectra of 
the differe nt classes as templates for redshift dete rmina- 
tions fe.g.. lLe Fevre et al.|[2005HLillv et al.ll2Q07l ). They 
represent a unique set able to reproduce all kinds of lo- 
cal galaxies. Unfortunately, the available redshift range 
is limitted since the bluest wavelength of the templates is 
as red as 3800 A, and galaxies with redshift > 2.5 will not 
overlap in any wavelength with the templates. However, 
one could overcome this problem classifying galaxies at 
various redshift ranges in various steps, very much in the 
vein of the classification for galaxies with redshift > 0.25 
explained in § After classification, the blue part of 
the average high redshift spectra can be used to provide 
blue parts for the templates. New templates with ex- 
tended blue wavelengths would be available, which can 
be used to extend the range even further by repeating 
the previous step. One can also use the classification 



to carry out relative measurements. For example, even 
if we ignore the metallicity of a galaxy, one can infer 
whether it is metal rich or poor with respect to the class 
mates using simple tools like the ratio between the fluxes 
of [NII]A6583 and Ha (e.g., Pettini k Pagel 2004) . All 
the systematic errors involved in such co mparison would 
be greatly reduced within a class (e.s... IStasihskal [200^ 
ISanchez Almeida et al.l [20091 ). Moreover, if the absolute 
metallicity of the class template is known, these simple 
recipes for relative measurements yield absolute metal- 
licities using the class spectrum for reference. 

11. DISCUSSION AND CONCLUSIONS 

We present an automatic unsupervised classification of 
all the galaxies with spectra in the final SDSS data re- 
lease (DR7). It uses the k-means algorithm, which sepa- 
rates the 930000 galaxies into 17 major classes containing 
99% of the galaxies, plus another 11 minor classes with 
the rest. The algorithm guarantees that the galaxies in 
a class have similar spectra, independently of their lumi- 
nosities. The algorithm does guarantees that the classes 
represent true clusters in the classification space, nor that 
all existing clusters are identified. Each ASK clas^] is 

^ Acronym for Automatic Spectroscopic K-means-based class. 



Classification of SDSS/DR7 galaxy spectra 



17 




Fig. 13. — BPT diagrams for the full set of galaxies with emission lines (top left panel), as well as for the galaxies belonging to the major 
ASK classes (remaining panels). The full set is shown as an image of the 2-dimensional histogram. This image is repeated in the rest of the 
panels for reference. We represent scatter plots for 10000 individual galaxies randomly drawn from the full SDSS/DR7 pool. The curved 
solid line separates star- forming galaxies (telow the line) and AGNs (above the line). In addition, the straight solid line in the region of 
AGNs separates Seyfert galaxies and LINERs. The plots also include a box symbol for the class template spectrum when it contains the 
required emission lines. These boxes are often buried within the cloud of dots. Major classes 19 and 22 are not shown because they do not 
fit into the figure, but they follow the sequence. 



characterized by an extra-low noise spectrum resulting 
from averaging all the spectra in the class. These tem- 
plate spectra vary smoothly and systematically among 
the classes, labeled according to their g — r color from 
ASK 0, the reddest, to ASK 27, the bluest (Fig. Hand 
Table [2]). The classes are well separated in the color 
sequence, with a class that collects most of the red se- 
quence galaxies (ASK 2), a set of classes lying along the 
blue cloud (ASK 9 and larger), and a class that seems 
to be characteristic of the green valley (ASK 5); see § [71 
Usually the classes of red galaxies do not present emission 
lines, however, when they do, their excitation is charac- 
teristic of AGN activity. In contrast, all the galaxies in 
the classes on the blue cloud seem to present emission 
lines, but they are typical of star formation regions. The 
classes in between (i.e., in the green valley) show AGN 
activity (see § |8]) . The ASK classification has been com- 
pared with the morphological Hubble type. Although the 
number of galaxies involved in this comparison is rather 
limited, it clearly shows how the red classes tend to have 
early morphological types, whereas the blue classes are 
morphologically late types (Fig. [TT] and § [S]) ■ The rela- 



tionship has a large intrinsic scatter, as previous studies 
also find (see §[TTT|). We have confronted the ASK classes 
with the PCA-based spectroscopic classification also ex- 
isting for SDSS/DR7 (§[5]). The two of them are consis- 
tent in the sense that ASK classes have well defined PGA 
eigenvalues. However, the ASK classes are finer. We note 
that the scatter between these two purely spectroscopic 
classifications is much smaller than the scatter in the re- 
lationship with Hubble type (§ (5]). The distribution of 
classes with redshifts is studied in § [51 and it reveals that 
the bluer classes contain galaxies of lower redshift, indi- 
cating that they are made of galaxies less luminous (and 
so smaller) than the red classes. The same preliminary 
analysis also suggests a trend for the red classes to be 
more clustered than the blue classes. 

All the above properties prove the consistency of the 
ASK classification. We have not found obvious contra- 
dictions between the physical properties of the classes, 
and the present understanding of galaxy properties. 
However, one should not forget the limitations of the 
analysis. The classification is not unique. We know that 
the borders between classes are not well defined, and 



18 



Sanchez Almeida et al. 




Fig. 14. — Cone diagrams of four representative classes; a red 
galaxy class (ASK 2), an AGN class (ASK 6), and two classes of 
blue galaxies (ASK 9 and ASK 16). Cone diagrams are polar plots 
where radius is redshift and azimuth is right ascension. In this case 
35° < declination < 45°, and redshift < 0.5 (see the labels on the 
rings). We are representing only a small fraction of all galaxies to 
avoid cluttering. 




0,01 0,10 1,00 10,00 

SFR [M„ yr"'] 



Fig. 15. — Scatter plot of Star Formation Rates (SFRs) inferred 
by interpolation on the ASK classes (ordinates) versus SFR, from 
the parameters of the individual galaxies (abscissas). The diagonal 
solid line indicates where abscissas and ordinates coincide. Only 
a subset of randomly selected starforming galaxies is represented 
(ASK > 7). 

that the actual number of classes is somewhat arbitrary 
(§[11 12. 1|) . The galaxy spectra seem to have a contin- 
uous distribution of properties, and we still ignore the 
reasons why k-means puts the borders between classes 
where it does. Moreover, the classification rely on a num- 
ber of reasonable but otherwise subjective hypotheses 
(e.g., the spectral bandpasses entering into the classifi- 
cation, or the normalization to the g-band; see § [3] and 
Table [ij . Alternative hypotheses would render classifi- 
cations differing from ASK in a way difficult to foretell. 
All these caveats notwithstanding, the classes inferred 
by ASK have different spectra that reflect a systematic 
difference in the gas and stars present in the galaxies. 



Understanding the physical reasons causing the system- 
atic differences between spectra would help understand- 
ing the classification itself. The study of the physical 
causes responsible for the observed diversity represents a 
major task that clearly goes beyond the scope of our in- 
troductory paper. However, a number of follow-up works 
dealing with the physical interpretation of classes are un- 
derway. As a result, some of the classes may need to 
be joined or split. For example, spectra of similar ob- 
jects with different degrees of extinction may have ended 
up in different classes (remember that we do not cor- 
rect for extinction to attain a purely empirical classifi- 
cation; § [3]) . Similarly, some of the classes may contain 
unidentified clusters. Sub-divisions can be achieved by 
applying k-means to selected spectral windows (e.g., the 
region around Ha may help us separati ng emission line 
galax ies according to their metallicity; IPettini fc Pagell 
120041 ). Deriving the star formation history of the classes 
is fundamental to understand whether the differences be- 
tween spectra are tracing different star formation histo- 
ries, AGN activities, merging histories, or something else. 
(Is the ASK classification revealing some sort of evolu- 
tive sequence for galaxies?) Fortunately, inversion codes 
able to constrain the star formation history are ava ilable 
(e.g.. lCid Fernandes et aL|[2005l: iTbieiro et al.ll2009D . and 
we plan to use them. We are also carrying out a compari- 
son between morphological types and spectroscopic types 
in a way that completes the introductory exercise in § [S) 
We try to understand what causes the scatter in the re- 
lationship between morphology and spectroscopy. Is it 
the different characteristic time of evolution of morpho- 
logical changes (on short timescales) and spectroscopic 
changes (on long timescales)? Is it the environment? 
The morphological classification will be based on the au- 
tomatic procedure by Huertas-Company et al. (2008) us- 
ing support vector machines, which will allow us to afford 
comparing morphology and spectral type for a sizeable 
fraction of the SDSS/DR7 spectroscopic catalog. Work 
to derive the luminosity function for the classes is pend- 
ing, i.e., to characterize the number density of galaxies 
of each luminosity and class. It is needed to quantify the 
tendency for high ASK classes to contain dwarf galaxies, 
as suggested in § O 

In addition to understanding the physical mechanisms 
responsible for the diversity among spectra, we foresee 
other applications of the ASK classification. It provides 
a crude but fast way of estimating some physical proper- 
ties of a galaxy once its ASK ascription is known (ij [TU)). 
The classification is also useful as target selection. For 
example, ASK 6 is formed by Seyfert galaxies. This 
class provides an ideal homogeneous sample of some 5000 
Seyferts with similar spectra for in-depth AGN studies 
(e.g., extending to low mass the relationship b etween su- 
permassive black-hole mass and bulge mass; see lFerraresd 
120061 and references therein) . The classification supplies 
classes of galaxies in the green valley (ASK 5). These 
targets allow us addressing the question of what char- 
acterizes a green valley galaxy, and one can do it in a 
statistically significant way. Is the green valley a short 
period during the life of any galaxy, or does it represent 
a genuine class of galaxies separated from the rest? The 
qualities assigned to each galaxy provide a simple way to 
find unusual objects. Low quality galaxies are outliers of 
the classification and, therefore, abnormal objects that 



Classification of SDSS/DR7 galaxy spectra 



19 



deserve specific follow-up work. The average spectra of 
the classes can be used as template for redshift deter- 
minations. They represent a unique set comprising all 
spectral types. This application of ASK requires extend- 
ing the template spectra to the UV, but this upgrade can 
be done in successive steps as outlined in § 1101 

In order to facilitate these and possibly other 
applications, we have made the ASK clas- 
sification freely available t hough the ftpsite 
[ftp: //ask :galaxy@f tp.iac.es/. We explain how 
it can be directly employed in SQL queries that use the 
CasJob facility of SDSS/DR7. We also provide it as 
ASCII CSV tables suitable for uses external to SDSS. In 
addition, the template spectra are included. 



Acknovirledgments. We are indebted to J. Betancort, 
I. G. de la Rosa, and M. Moles, that contributed with dis- 
cusions on their area of expertise. Thanks to an anony- 
mous referee we include the discussion on the degree of 
clustering of the classes at the end of § S) The work has 
been partly funded by the Spanish Ministries of science, 
technology and innovation, projects AYA 2007-67965-03- 
1, AYA 2007-67752-C03-01, and CSD2006-00070. Fund- 



ing for the Sloan Digital Sky Survey (SDSS) and SDSS-II 
has been provided by the Alfred P. Sloan Foundation, the 
Participating Institutions, the National Science Foun- 
dation, the U.S. Department of Energy, the National 
Aeronautics and Space Administration, the Japanese 
Monbukagakusho, and the Max Planck Society, and the 
Higher Education Funding Council for England. The 
SDSS is managed by the Astrophysical Research Con- 
sortium (ARC) for the Participating Institutions. The 
Participating Institutions are the American Museum of 
Natural History, Astrophysical Institute Potsdam, Uni- 
versity of Basel, University of Cambridge, Case Western 
Reserve University, The University of Chicago, Drexel 
University, Fermilab, the Institute for Advanced Study, 
the Japan Participation Group, The Johns Hopkins Uni- 
versity, the Joint Institute for Nuclear Astrophysics, the 
Kavli Institute for Particle Astrophysics and Cosmol- 
ogy, the Korean Scientist Group, the Chinese Academy 
of Sciences (LAMOST), Los Alamos National Labora- 
tory, the Max-Planck- Institute for Astronomy (MPIA), 
the Max-Planck-Institute for Astrophysics (MPA), New 
Mexico State University, Ohio State University, Univer- 
sity of Pittsburgh, University of Portsmouth, Princeton 
University, the United States Naval Observatory, and the 
University of Washington. 



APPENDIX 

GALAXIES IN A CLUSTER AS A FUNCTION OF THE CLUSTER CENTER 

The random initialization of k-means leads to small uncertainties in the properties of the clusters which, however, 
produce a significant variation on the actual galaxies assigned to each cluster (§ [21 § H]). This amplification can be 
understood by considering that clusters are defined by regions in a space of many dimensions. A small uncertainty 
in the center of the cluster produces large (relative) variations of the region of space that the cluster samples. Below 
we compute this boost factor under the simplifying assumption that the clusters are defined by hyper-spheres. Such 
simplification should not affect the conclusion we draw since the scaling relationship between volume and area is a 
general property of the space, rather then specific to a particular shape. 

Assume that the galaxies belonging to a class are those within a sphere of radius R centered in the class center. 
Assume that the galaxies are uniformly distributed around this center. Two different runs of the k-means clustering 
algorithm yield slightly different centers for the class, separated by a distance AR. How many galaxies will be shared 
by the two classifications? Under the previous hypotheses, it is just the overlapping volume of two n— dimensional 
hyper-spheres of radius R when their centers are separated AR. When AR/ R <^ 1, such volume is the volume of one 
of the original hyper-spheres minus the volume of a cylinder of height AR and base the corresponding n— dimensional 
hyper-disk (i.e., an hyper-sphere in n — 1 dimensions). Using the expression for the volume of a sphere in n dimensions, 
the number of common galaxies, N{AR), normalized to the number of galaxies in the class, A^(0), turns out to be. 



N{AR) 



AR 



v/n/27r. 



(Al) 



Equation (IA1[) shows how a small relative error in the position of the cluster center gets amplified as n/27r when 
affecting the drop in the number of common galaxies. Since n 3> 1, the drop is very large. For example, if n = 2000, 
a minute AR/R ~ 2% produces N{AR)/N{0) ~ 65%. Variations induced by cluster radius changes are even more 
dramatic. In this case the boost factor scales with n rather than ^/n. 



REFERENCES 



Aaronson, M. 1978, ApJ, 221, L103 

Abazajian, K. N., Adelman-McCarthy, J. K., Agiieros, M. A., 

et al. 2009, ApJS, 182, 543 
Baldry, I. K., Glazebrook, K., Brinkmann, J., et al. 2004, ApJ, 

600, 681 

Baldwin, J. A., Phillips, M. M., & Terlevich, R. 1981, PASP, 93, 5 
Balogh, M. L., Baldry, I. K., Nichol, R., et al. 2004, ApJ, 615, 
LlOl 

Bershady, M. A. 1995, AJ, 109, 87 

Bershady, M. A., Jangren, A., & Conselice, C. J. 2000, AJ, 119, 
2645 

Bishop, C. M. 2006, Pattern Recognition and Machine Learning 
(NY: Springer) 



Blanton, M. R. & Roweis, S. 2007, AJ, 133, 734 

Bradley, P. S. & Fayyad, U. M. 1998, Refining Initial Points for 

k-means Clustering, Tech. rep., Microsoft Research, 

MSR-TR-98-36 

Bromley, B. C, Press, W. H., Lin, H., & Kirshner. R. P. 1998, 
ApJ, 505, 25 

Chan, B. H. P., Mitchell, D. A., & Cram, L. E. 2003, MNRAS, 
338, 790 

Cid Fernandes, R., Mateus, A., Sodre, L., Stasinska, G., & 

Gomes, J. M. 2005, MNRAS, 358, 363 
Connolly, A. J., Szalay, A. S., Bershady, M. A., Kinney, A. L., & 

Calzetti, D. 1995, AJ, 110, 1071 
Conselice, C. J. 2006, MNRAS, 373, 1389 



20 



Sanchez Almeida et al. 



Deeming, T. J. 1964, MNRAS, 127, 493 

Eisenstein, D. J., Annis, J., Gunn, J. E., et al. 2001, AJ, 122, 2267 
Everitt, B. S. 1995, Cluster Analysis (London: Arnold) 
Ferrarese, L. 2006, in Joint Evolution of Blax;k Holes and 

Galaxies, cd. M. Colpi, V. Gorini, F. Haaxdt, & U. Moschella 

(New York: Taylor & Francis), 1 
Ferrcras, I., Pasquali, A., dc Carvalho, R. R., de la Rosa, I. G., & 

Lahav, O. 2006, MNRAS, 370, 828 
Folkcs, S., Roncn, S., Price, I., et al. 1999, MNRAS, 308, 459 
Folkes, S. R., Lahav, O., & Maddox, S. J. 1996, MNRAS, 283, 651 
Formiggini, L. & Brosch, N. 2004, MNRAS, 350, 1067 
Francis, P. J., Hewett, P. C., Foltz, C. B., & Chaflfee, F. H. 1992, 

ApJ, 398, 476 

Fukugita, M., Nakamura, O., Okamura, S., et al. 2007, AJ, 134, 
579 

Heavens, A., Pantcr, B., Jimenez, R., & Dunlop. J. 2004. Nature. 
428, 625 

Hubble, E. P. 1936, Realm of the Nebulae (New Haven: Yale 

University Press) 
Huertas-Company, M., Rouan, D., Tasca, L., Soucail, G., & Le 

Fcvrc, O. 2008, A&A, 478, 971 
Huertas-Company et. al., M. 2009, A&A, submitted 
Humason, M. L. 1931, ApJ, 74, 35 
Jackson, J. C. 1972, MNRAS, 156, IP 

Kannappan, S. J., Guie, J. M., & Baker, A. J. 2009, AJ, 138, 579 
Kauffmann, G., Hcckman, T. M., Tremonti, C, et al. 2003, 

MNRAS, 346, 1055 
Kcnnicutt, Jr., R. C. 1998. ARA&A, 36, 189 
Kennicutt, Jr.. R. C. 1992. ApJS, 79, 255 

Le Fevrc, O.. Vcttolani. G.. Garilli, B., et al. 2005, A&A, 439, 845 
Lilly, S. J., Lc Fcvrc. O.. Rcnzini, A., et al. 2007, ApJS, 172, 70 
Lu, H., Zhou, H., Wang, J., ct al. 2006, AJ, 131, 790 
Madgwick, D. S. 2003, MNRAS, 338, 197 

Madgwick, D. S., Coil, A. L., Conselice, C. J., et al. 2003, ApJ, 
599, 997 



Mittaz, J. P. D., Pension, M. V., & Snijders, M. A. J. 1990, 

MNRAS, 242, 370 
Morgan, W. W. & Mayall, N. U. 1957, PASP, 69, 291 
Peiia, J. M., Lozano, J. A., & Larranaga, P. 1999, Pattern 

Recognition Letters, 20, 1027 
Pettini, M. & Pagel, B. E. J. 2004, MNRAS, 348, L59 
Reichardt, C, Jimenez, R., & Heavens, A. F. 2001, MNRAS, 327, 

849 

Salim, S., Rich, R. M., Chariot, S., et al. 2007, ApJS, 173, 267 
Sanchez Almeida, J., Aguerri, J. A., Munoz-Tunon, C, & 

Vazdekis, A. 2009, ApJ, 698, 1497 
Sanchez Almeida, J. & Lites, B. W. 2000, ApJ, 532, 1215 
Sandage, A. 2005, ARA&A, 43, 581 

Schawinski, K., Lintott, C. J., Thomas, D., et al. 2009, ApJ, 690, 
1672 

Schawinski, K., Thomas, D.. Sarzi, M., et al. 2007, MNRAS, 382, 
1415 

Slonim, N., Somcrvillc, R., Tishby, N., & Lahav, O. 2001, 

MNRAS, 323, 270 
Sodre, L. & Cuevas, H. 1997, MNRAS, 287, 137 
Sodre, Jr., L. & Cuevas, H. 1994, Vistas in Astronomy, 38, 287 
Stasinska, G. 2004, in Cosmochemistry. The melting pot of the 

elements, cd. C. Esteban, R. Garcia Lopez, A. Herrero, & 

F. Sanchez (Cambridge: CUP). 115 
Stoughton. C. Lupton, R. H., Bernardi, M., et al. 2002, AJ, 123, 

485 

Strateva, I., Ivczic, Z., Knapp, G. R., et al. 2001, AJ, 122, 1861 
Tojeiro, R., Wilkins, S., Heavens, A. F., Panter, B., &: Jimenez, 

R. 2009, ApJS, 185, 1 
Vanderplas, J. & Connolly, A. 2009. AJ, 138, 1365 
Whitney, C. A. 1983, A&AS. 51, 443 

Worthey, G., Fabcr, S. M., Gonzalez, J. J., & Burstcin, D. 1994, 

ApJS, 94, 687 
Worthey, G. & Ottaviani, D. L. 1997, ApJS, 111, 377 
Yip, C. W., Connolly, A. J., Szalay, A. S., et al. 2004, AJ, 128, 

585 

Zaritsky, D., Zabludoff, A. L, & Willick, J. A. 1995, AJ, 110, 1602 



